Project : CreditCard Users Churn Prediction

Data Dictionary:

Load the Dataset

Observations:

Observations:

Based on the unique values for each column, we have:

Following Continous columns:
  1. CLIENTNUM
  2. Customer_Age
  3. Months_on_book
  4. Credit_Limit
  5. Total_Revolving_Bal
  6. Avg_Open_To_Buy
  7. Total_Amt_Chng_Q4_Q1
  8. Total_Trans_Amt
  9. Total_Trans_Ct
  10. Total_Ct_Chng_Q4_Q1
  11. Avg_Utilization_Ratio
Following Categorical columns:
  1. Attrition_Flag
  2. Gender
  3. Dependent_count
  4. Education_Level
  5. Marital_Status
  6. Income_Category
  7. Card_Category
  8. Total_Relationship_Count
  9. Months_Inactive_12_mon
  10. Contacts_Count_12_mon

Data Summary and Distribution Overview

Observations:

  1. Attrition_Flag- has 2 unique values Existing Customer(83.9%) and Attrited Customer(16.06%). This is the dependent variable and has imbalanced classes.

  2. Gender- has 2 unique values "F" (52.9%) and "M" (47.09%). Female percentage is slightly higher than Male percent but they are not highly imbalanced.

  3. Dependent_count has 6 unique values that signify that each customer has either 0,1,2,3,4,5 or 6 dependents. 8.9% customers have 0 dependents, 18.1% customers have 1 dependent, 26.2% customers have 2 dependents, 26.9% customers have 3 dependents, 15.5% customers have 4 dependents and 4.2% customers have 5 dependents.

  4. Education_Level- Most customers have a Graduate Degree, followed by High passouts. There are a very few customers who have a Doctorate or a Post-Graduate degree. There are some customers who are still in College and for 1519 customers Education Level is Unknown.

  5. Marital_Status- Most of the customers of the Bank are Married followed by 3943 Single customers. For almost equal number of customers, the Marital Status is Unknown (749) or they are Divorced (748)

  6. Income_Category- Most of the customers (3561) earn less than 40K annually, followed by people earning annually between 40k-60k (1790), 1535 of all customers annually earn between 80k-120k, 1402 customers annually earn between 60k-80k, 727 customers annually earn over 120K and for 1112 customer, the Annual Income is Unknown

  7. Card_Category- 9436 customers have "Blue" credit card, 555 have "Silver" and 20 customers have "Platinum" Credit Card. Looks like the "Blue" Credit card is a starter credit card and is the one that most of the customers have.

  8. Total_Relationship_Count- 2305 customers have 3 products from the bank, 1912 customers have 4 products from the bank, 1891 customers have 5 products from bank, 1866 customers have 6 products, 1243 customers have 2 products from the bank and 910 customers have just 1 product from the bank.

  9. Months_Inactive_12_mon- 29 customers out of all have 0 inactive months, meaning that they use their credit card most regularly for transactions, 124 customers have not used their credit card for around 6 months, 178 customers have not used their credit card for 5 months, 435 customers have not used their credit card for around 4 months, 3846 customers have not used their credit card for around 3 months, 3282 customers have not used their credit card for around 2 months and 2233 customers have not used their credit card for 1 month.

  10. Contacts_Count_12_mon- Out of all the customers, most of the customers have contacted the bank 3 (3380) or 2 (3227) times in last 12 months. However very less customers have contacted 5 (176) or 6 (54) times. This represents that probably Bank has a online self services for the products and most of the customers manage their accounts themselves with those self services.

Observations:

  1. Customer_Age is almost normally distibuted with a slight right skewed. The range is from 26 years to 73 years.

  2. Dependet_count is normally distributed. Even though the mean value is 2.3 its not possible to have 2.3 dependents, so we can round it off to 2. Customers have 0-5 dependents.

  3. The average Period of relationship with the bank (Months_on_book) is 35.9 months which is almost equal to median value showing that distribution is almost normal.

  4. Total_Relationship_Count shows that customers have 1-6 products from the bank. Most of the customers have 4 products.

  5. Credit_Limit range is from 1438 (unit unknown) to 34516. The average value of Credit limit is 8631 and the median value is 4549. This feature is not normally distributed and right skewed.

  6. Average Total_Revolving_Bal is 1163 showing that on an average customers carry over this much balance month on month. The average value is slightly lesser than median value- 1276 showing that this is not a normally distributed feature.

  7. Avg_Open_To_Buy refers to the amount left on the credit card to use (Average of last 12 months). The mean value of this feature for all the customers is 7469 which is higher than the median value 3474. The range is from 3 to 34516, this is a large range that shows that some of the customers hardly use the creadit card however many cuatomers use the entire credit amount.

  8. Total_Amt_Chng_Q4_Q1-The range of this ratio is 0 to 3.4, this is a large range, the mean value is 0.76 that shows that totoal transaction amount in Q4 is lesser than total transaction amount in Q1 on an average for a customer. We can further investigate if we are able to find the reason for this change.

  9. Total_Ct_Chng_Q4_Q1- the range of this feature is 0-3.7. The mean and median values are 0.7 which again is less than 1 indicating that the number of transactions for on an average have reduced for the cuatomers in Q4 as compared to Q1. This is inline with what we saw for Total_Amt_Chng_Q4_Q1.

  10. Total_Trans_Amt is the total transaction amount in the last 12 months, range is from 510 to 18484, this is a large range with mean value of 4404 and median of 3899. The mean and median are slightly close to each other, however the max value is farther away showing that there might be few outliers.

  11. Total_Trans_Ct is the total transaction count in the last 12 months, the range is 10 transcations to 139 transactions, this is again a large range showing the varied usage among customers. The mean is 64 counts vs the median is 81 transactions, again there are few outliers that we will study further and the distribution is not normal.

  12. Avg_Utilization_Ratio, the range os this feature is 0-0.99. The median of 0.5 is greater than mean of 0.3 which shows that this is not normally distributed.

Exploratory Data Analysis

Univariate

Observations:

  1. The mean and median are approx. equal and Customer_Age has a normal distribution.
  2. There are 2 outliers (age>70) in the distribution, but we will not remove these as they represent real world scenario. Bank can always has customers who are over 70 years old.

Observations:

  1. There are outliers on around both upper and lower quartiles, showing that there are few customers who have very long relationship with the bank, as well as customers who are new clients.
  2. The mean and median are very close and the feature has a normal distribution with a high spike at the median.

Observations:

  1. Credit_Limit has lot of outliers, looks like some customers have a higher credit limit than most of the customers.
  2. We will not treat the outliers as it represents real world data, Bank can offer higher crefit limit to older customers who repay their credit and might have other deposits with the bank. This can also represent some Business Accounts, but this is an assumption as nothing is mentioned about it in the data details.
  3. The mean is greater than median, the distribution is right skewed with a long tail to the rihgt.

Observations:

  1. Mean for Total_Revolving_Bal is lesser than median.
  2. The distribution has a longer tail towards left and is negatively skewed.
  3. There are no outliers in the distribution
  4. There are 2471 customers who have 0 Total Revolving balance, this represents that there are customers with the bank who repay the full billed amount every month and do not take it over to next month.

Observations:

  1. Mean is greater than median, the distribution is right skewed.
  2. There are many outliers, but we will not treat them as they represent real world values, there might be few customers who either do not use their credit card (credit limit at all) or use it very little and hence have to much left to be spent.

Observations:

  1. The mean and median is almost same for Total_Amt_Chng_Q4_Q1 and the feature is normally distributed.
  2. There are few outliers around both lowe and upper quartile representing that there are few customers for who the ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter was lesser or higher than the median value. We will not be treating these outliers as they represent the real world data.

Observations:

  1. There are couple of peaks in the distribution of Total Transation Amount.
  2. The mean is higher than median representing the skewness of data. The distribution has longer right tail.
  3. There are several outliers indicating that many customers use the credit card more than others. We will not treat these outliers as this represent the real world data.
  4. The data is not uniformly distributed and has some small peaks, but since this represent the real scenario (data) we will not treat these.

Observations:

  1. Total_Trans_Ct feature is normally distributed with few small peaks.
  2. There is little tail towards right showing that this feature is slightly skewed.
  3. There are couple of outliers, but we will not treat them as there might be customers who use the credit card more than average customers.

Observations:

  1. Total_Ct_Chng_Q4_Q1 is normally distributed.
  2. There are many outliers which show that some customers have more transactions in Q4 as compared to Q1. This is possible scenario and represent real world data, so we won't treat these.

Observations:

  1. The mean is higher than median for Avg_Utilization_Ratio, the data for this feature is not normally distributed and is right skewed.
  2. There are no outliers with this data.

Observations:

  1. Attrition_Flag is imbalanced class.
  2. 16.1% people have closed credit card account with the bank and 83.9% are existing customers.

Observations:

  1. 52.9% of the bank customer are Females and 47.1% are male.
  2. This is a good mix of people and we can say that Gender feature is almost balanced.

Observations:

  1. Dependent_count represents the number of of dependents for each customer.
  2. Range for this feature is from 0-5
  3. Most customers have either 2 or 3 dependents, followed by either 1 or 4 dependents. very few customers have 5 dependents and almost 9% customers have zero dependents.

Observations:

  1. The most prominent Education Level among the bank customer is Graduate, almost 31% are graduates, followed by High Schoolers.
  2. Almost equal percent of customers are either (14.7%)Uneducated (probably never went to school or didn't complete High school education) or their Education level is unknown (15%).
  3. Very small percent of customers are either Post Graduates (5.1%) or have a Doctorate (4.5%) degree.
  4. Around 10% of customers are college students.
  5. Based on this mix of data looks like the bank offers credit cards to people (High School/ College) who either do not make earnings or might be making small earnings and probably offer small credit limit.
  6. Since there are different creditcards availalbe, we will further explore what kind of Credit Card is preferred by each customer segment.

Observations:

  1. Most of the customers are married, followed by Single customers.
  2. The percentage of Divorced and Unknown Marital status is same.
  3. We will further explore if relationship status has any implications on Attiration.

Observations:

  1. Most of the customers (35%) make less than 40K, 18% make between 40k-60k.
  2. Almost 14% make either 60k-80k or 80k-120k.
  3. Very less percentage of customers (7%) make more than 120K and for 115 customers this information is Unknown.
  4. We might explore further if Education level and Income has correlation in this data set and since we have seen that most of the customers are Graduates followed by almost 19% High schoolers, this information shows that some of those customers might be making less than 40K per year, we will explore this further.

Observations:

  1. Out of the 4 types of credit cards offered by the bank, the "Blue" credit card is the most popular one which is used by 93% of the customers.
  2. Only 5.5% customers use the Silver Credit Card, followed by 1.1% Gold Credit Card holders.
  3. Very negligible number of customers (0.2%) use the Platinum credit card.

Observations:

  1. Most of the customers (23%) use atleast 3 products with the bank this could be savings account or any other type of account along with having a Credit Card.
  2. Almost same percentage of customers (18-19%) use either 4, 5 or 6 bank prodcuts.
  3. We will further explore if total products used have any correlation with Attiration flag.

Observations:

  1. 38% of customer have been inactive for 3 months (in last 12 months), followed by 32.4% for 2 months and 22% for 1 month.
  2. Very less number of custoemers have been inactive for 0,4,5 or 6 months.
  3. There is no information available as to why the customers were inactive, but we will further explore if there is any correlation with Attiration and customer's inactivity.

Observations:

  1. Most of the customers have contacted the bank 2 or 3 times in last 12 months, followed by customers who have contacted bank 1 or 4 times.
  2. This data represents that the bank might have online self services available for its customers which are effectively used by the customers and hence do not need to contact the bank very often.

Exploratory Data Analysis

Multivariate Correlation and Bivariate Analysis

Let's try to compare features with respect to each other, we will try to focus on these relationships:

But first, let's pairplot the features

Observations:

Observations:

  1. Customer_Age has high positive correlation with Months_on_book.
  2. Total_Trans_Ct has high correlation with Total_Trans_Amt which shows that more transactions implies, more credit amount used.
  3. Total_Revoloving_Bal has high correlation with Avg_Utilization_Ratio which represents that people who spend a lot of available credit might be carrying over the balance month to month.

Observations:

  1. Most of the Attirated Customers have contacted the bank 3 times in last 12 months, followed by 2 times in last 12 months.
  2. Almost similar pattern can be seen in existing customers except the most customers have contacted the bank 2 times, followed by 3 times in last 12 months.

Observations:

  1. Most of the customers who attirated were inactive for 3 months in last 12 months, followed by 2 months of inactivity.
  2. There are some existing customers who have been inactive for 3 months, followed by 2 months of inactivity. There are 2133 Existing customer who have been inactive for a month.

Observations:

  1. Most of Attirated customers have total of 3 relationships (use 3 products of the bank) with the bank, followed by 2 and 1 relationship. 1 relationship reprensents that customers might only have the Credit Card with the bank.
  2. Most of the Existing customers have 3 relationships with the bank, followed by almost equal number of customers who have 4, 5 or 6 relationships.

Observations:

  1. Both in Attirated and Existing Customers category, Blue card is the most popular one.

Observations:

  1. Most of the Attirated and Existing Customers make less than 40K.
  2. In both categories Existing and Attirated customers the next hightest Income_Category is 40K-60K.

Observations:

  1. The plot represents that the mostly Attirated customers have a lower Total_Revolving_Bal compared to Existing customers.
  2. Some of the Attirated customers might have Total_Revolving_Bal 800-1400 which overlaps with Existing Customers too.

Observations:

  1. Very similar pattern can be seen in this feature's data for both Existing and Attirated customers.
  2. Most of the Existing and Attirated customers have 31-40 months of relationship with the bank. There are outliers for both Existing and Attirated customers i.e. there are customers who have relationship that are younger than 20 months as well as older than 52 months.

Observations:

  1. Avg_Open_To_Buy- Most of the customers (Existing and Attirated both) have 100-10000 of amount left on the credit card to use.
  2. Most of the Attirated customers have Total_Trans_Ct in the range of 38-50, however most of the Existing customer have 50-80 Total_Trans_Ct.
  3. 50% of the total Attirated customers have 2500-3500 as Total_Trans_Amt, however Existing Customers have 3000-5000.
  4. Most of the Attirated customers have 0-0.2 as Avg_Utilization_Ratio however Existing customers have 0.05-0.5.

Observations:

  1. Credit_Limit is highest for Platinum Credit Card, most of the customers have 33000-35000 credit limit for platinum card.
  2. For Silver Credit card, most popular credit limit is 15000-35000, for Gold credit card, the credit limit range is 23500-35000 and Blue credit card has lowest credit limit range 2500-8000.

Observations:

  1. Since most of the customers use Blue credit card, we have most of the data for Blue Credit card, the next highly used one is Silver credit card.
  2. Most of the Blue credit card holders earn less than 40K, followed by customers who earn 40K-60K.
  3. Similar pattern can be seen for Silver credit card.
  4. For Gold and Platinum credit cards, its difficult to comment as there is not enough data.

Observations:

  1. Avg_Utilization_Ratio is highest for Blue Credit card that shows that most of the customers who have Blue Credit card spent more than the customers who have Gold, Silver or Platninum Credit Card.
  2. Total_Trans_Amt is highest for Platinum Credit card which makes sense as the Credit_Limit is also highest for Platinum Credit Card.
  3. Total_Revolving_Bal is very similar across Blue, Gold, Silver and Platinum credit card.
  4. Total_Trans_Ct is also high for Platinum credit card as compared to Blue Credit Card.
  5. Avg_Open_To_Buy (amount left on the credit card to use) is very low for Blue Credit Card as compared to Platinum, Silver or Gold Credit card.

Data Preparation

  1. We have seen that few categorical columns have "Unknown" values, these reprensent missing values in the dataset. We will treat these missing values with KNN imputer. The features of interest that have "Unknown" values are:

    • Education_Level
    • Marital_Status
    • Income_Category
  2. Few Continous features that are ratios have "0" value that indicate that the the numerator for those ratios was marked zero (null value). Since a ratio cannot be zero, we will also treat these zeros as missing values and use KNN imputer to treat these. The features of interest that have "0" value of ratio are:

    • Total_Amt_Chng_Q4_Q1
    • Total_Ct_Chng_Q4_Q1
    • Avg_Utilization_Ratio

Observations:

We have to impute values for following categorical columns:

- Education_Level             
- Marital_Status               
- Income_Category             

and following numerical columns:

- Total_Amt_Chng_Q4_Q1           
- Total_Ct_Chng_Q4_Q1            
- Avg_Utilization_Ratio       

Split dataset in Train and Test sets

Imputing Missing Values

We will use KNN imputer to impute missing values, using default value for n_neighbors=5.

Encoding categorical variables

Building the model

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will Attirate and the customer doesn't Attirate - Loss of resources Bank spends on improving services for this customer.

  2. Predicting a customer will not Attirate and the customer Attirates - Loss of clientale when customer renounces their credit card.

Which case is more important?

How to reduce the loss of customers i.e need to reduce False Negatives?

Logistic Regression

Let's evaluate the model performance by using KFold and cross_val_score

Oversampling train data using SMOTE

Logistic Regression on oversampled data

Let's evaluate the model performance by using KFold and cross_val_score

Regularization

Undersampling train data using SMOTE

Logistic Regression on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

Finding the coefficients

Coefficient interpretations

Converting coefficients to odds

Observations:

Model building - Bagging and Boosting

We will use Pipeline to build these models.

Grid Search Tuning- Xtreme Gradient Boost

Observations:

  1. XGB Tuned (Grid Search CV) is not overfitting.
  2. The recall on training and test set is really good.
  3. Precision on training set is good but slightly lower on test set.
  4. XGB Grid Search took 1h 8min 28s to run and search the best param values.

Grid Search Tuning- Ada Boost

Observations:

  1. AdaBooster Tuned (Grid Search CV) is not overfitting.
  2. The recall on training is good but slightly lower on test set.
  3. Precision on training set and test set is good.
  4. AdaBooster Classifier grid search took 1min 13s to run.

Grid Search Tuning- Gradient Boost

Observations:

  1. GB Tuned (Grid Search CV) is not overfitting.
  2. The recall on training and test set is good but lower than AdaBoost Tuned (Grid Search CV) and XGBoost Tuned (Grid Search CV)
  3. Precision on training set is good but slightly lower on test set.
  4. Gradient Boost classifier Grid search took 7min 49s to run.

Random Search CV- Xtreme Gradient Boost

Observations:

  1. XGB Tuned (Random Search CV) is not overfitting.
  2. Accuracy on train and test set is better than Grid Search CV.
  3. The recall on training and test set is really good and comparable to Grid Search.
  4. Precision on training set is low also lower than what we got with Grid Search.
  5. Random Search CV on XGB took 1min 46s to run.

Random Search CV- AdaBoost Classifier

Observations:

  1. Accuracy on training set and test set is pretty good, shows that the model is not overfitting. This is comparable to what we saw for AdaBoost tuned model with grid search.
  2. Recall on train set is slightly lower but better on test set when comparaed to AdaBoost tuned model with grid search.
  3. Precision is also comparable to AdaBoost tuned model with grid search.
  4. Random Search CV on Ada Booster Classifier took 2min 10s to run.

Random Search CV- Gradient Boost Classifier

Observations:

  1. Accuracy, Recall and Precision, all three, on Train and test set is better as compared to GB Classifier with Grid Search.
  2. The model is not overfitting

Comparing all models

Observations:

Observations:

  1. Top 2 important features on both XGB (Grid Search) are XGB (Random Search) Total_Trans_Ct and Total_Revolving_Bal.
  2. Months_Inactive_12_mon is the third top feature on XGB (Random) however on 9th position on XGB (Grid) tuned model.
  3. Total_Relationship_Ct, Total_Trans_Amt, Total_Amt_Chng_Q4_Q1 and Total_Ct_Chng_Q4_Q1 are other important features to lookout for.

Business Recommendations:

Bank's objective is to target the customers who might renounce their credit card and improve services for them, so that they do not leave. To do that:
  1. Bank should target customers who have low transaction count or low transaction amount in last 12 months, bank can reach out and do proactive surveys as well as try to see if they can offer customers loyalty points/discounts per transactions so that there is incentive for customers to use the Credit Card.

  2. Bank should target customers who have a lower revolving balance indicating that customers are either not using the credit card too frequewntly or using it for small amount transactions. Again, bank can offer loyalty points for high amount transactions. Bank can also look at giving some incentive to customers to transfer balance of other credit card (possibly from another bank), so that customers have more at stake with the bank and are more involved.

  3. Since most of the customers use "Blue" card predominantly, Bank should look at enhancing the customer base for its Gold, Platinum and Silver Credit Cards. Bank can offer signup bonus programs for these less preferred credit cards, may be offer travel points etc, so that these cards are lucarative to the customers.

  4. Since most of the Bank customers are young people who are Graduates and make less than 40K per year, its seems that Bank needs to have more options for this clientale. Bank can offer credit limit in low range to attact more customers, as well as offer lower or no annual fee for certain time to attract more customers.

  5. Since bank has a lot of College students and High Schoolers as customers, Bank can offer special discounts or points when the these customers (students) make purchase for Education purposes like buying electronics such as Laptop/ Computers etc, Online Education portals, Books etc, so that customers can use the credit card frequently easily.

  6. Since most of the customer of bank fall under the category of Married Graduates, earning less than 40K per year and own the "Blue" credit card, the main aim of this category would be to make the most of the credit card, by either saving money when using this card, or assimilating points for their purchases. Bank should look at lucarative ways to engage, attract and retain such customers by giving them loyalty points for transactions, small insentives when the credit balance is paid early, so that there is a reason for the customers to stay. Bank can engage with merchants where their credit card is mostly used and see if they can collaborate for provide insentives to customers.